library(CARBayesdata)
data(pollutionhealthdata)
data(GGHB.IZ)In this practical we will:
- Explore tools for areal spatial data wrangling and visualization.
- Compute Morans’I to identify spatial autocorrelation in our data.
Areal (lattice) data
Areal data our measurements are summarised across a set of discrete, non-overlapping spatial units such as postcode areas, health board or pixels on a satellite image. In consequence, the spatial domain is a countable collection of (regular or irregular) areal units at which variables are observed. Many public health studies use data aggregated over groups rather than data on individuals - often this is for privacy reasons, but it may also be for convenience.
In the next example we are going to explore data on respiratory hospitalisations for Greater Glasgow and Clyde between 2007 and 2011. The data are available from the CARBayesdata R Package:
The pollutionhealthdata contains the spatiotemporal data on respiratory hospitalisations, air pollution concentrations and socio-economic deprivation covariates for the 271 Intermediate Zones (IZ) that make up the Greater Glasgow and Clyde health board in Scotland. Data are provided by the Scottish Government and the available variables are:
IZ: unique identifier for each IZ.year: the year were the measruments were takenobserved: observed numbers of hospitalisations due to respiratory disease.expected: expected numbers of hospitalisations due to respiratory disease computed using indirect standardisation from Scotland-wide respiratory hospitalisation rates.pm10: Average particulate matter (less than 10 microns) concentrations.jsa: The percentage of working age people who are in receipt of Job Seekers Allowanceprice: Average property price (divided by 100,000).
The GGHB.IZ data is a Simple Features (sf) object containing the spatial polygon information for the set of 271 Intermediate Zones (IZ), that make up of the Greater Glasgow and Clyde health board in Scotland ( Figure 1 ).
Let’s start by loading useful libraries:
library(sf)
library(ggplot2)
library(scico)The sf package allows us to work with vector data which is used to represent points, lines, and polygons. It can also be used to read vector data stored as a shapefiles.
First, lets combine both data sets based on the Intermediate Zones (IZ) variable using the merge function from base R:
resp_cases <- merge(GGHB.IZ, pollutionhealthdata, by = "IZ")In epidemiology, disease risk is usually estimated using Standardized Mortality Ratios (SMR). The SMR for a given spatial areal unit \(i\) is defined as the ratio between the observed ( \(Y_i\) ) and expected ( \(E_i\) ) number of cases:
\[ SMR_i = \dfrac{Y_i}{E_i} \]
A value \(SMR > 1\) indicates that there are more observed cases than expected which corresponds to a high risk area. On the other hand, if \(SMR<1\) then there are fewer observed cases than expected, suggesting a low risk area.
We can manipulate sf objects the same way we manipulate standard data frame objects via the dplyr package. Lets use the pipeline command %>% and the mutate function to calculate the yearly SMR values for each IZ:
library(dplyr)
resp_cases <- resp_cases %>%
mutate(SMR = observed/expected, .by = year )Now we use ggplot to visualize our data by adding a geom_sf layer and coloring it according to our variable of interest (i.e., SMR). We can further use facet_wrap to create a layer per year and chose an appropriate color palette using the scale_fill_scico from the scico package:
ggplot()+
geom_sf(data=resp_cases,aes(fill=SMR))+
facet_wrap(~year)+scale_fill_scico(palette = "roma")As with the other types of spatial modelling, our goal is to observe and explain spatial variation in our data. Generally, we are aiming to produce a smoothed map which summarises the spatial patterns we observe in our data.
A key aspect of any spatial analysis is that observations closer together in space are likely to have more in common than those further apart. This can lead us towards approaches similar to those used in time series, where we consider the spatial closeness of our regions in terms of a neighbourhood structure.
The function poly2nb() of the spdep package can be used to construct a list of neighbors based on areas with contiguous boundaries (e.g., using Queen contiguity).
library(spdep)
W.nb <- poly2nb(GGHB.IZ,queen = TRUE)
W.nbNeighbour list object:
Number of regions: 271
Number of nonzero links: 1424
Percentage nonzero weights: 1.938971
Average number of links: 5.254613
2 disjoint connected subgraphs
The warning tell us that the neighbourhood is comprised of two interconnected regions. By looking at the neighbourhood graph below, we can see that these are the North and South Glasgow regions which are separated by the River Clyde.
plot(st_geometry(GGHB.IZ), border = "lightgray")
plot.nb(W.nb, st_geometry(GGHB.IZ), add = TRUE)You could use the snap argument within poly2nb to set a distance at which the different regions centroids are consider neighbours. To do so we first need to be aware about the spatial units of the spatial coordinate reference system (CRS). We can check this as follows:
st_crs(GGHB.IZ)$units[1] "m"
Then, we could set a distance of 250m to join the IZ centroids that are are less than 250m apart.
W.nb250 <- poly2nb(GGHB.IZ,snap=250)
W.nb250Neighbour list object:
Number of regions: 271
Number of nonzero links: 1758
Percentage nonzero weights: 2.393758
Average number of links: 6.487085
plot(st_geometry(GGHB.IZ), border = "lightgray")
plot.nb(W.nb250, st_geometry(GGHB.IZ), add = TRUE)Once we have identified a set of neighbours using our chosen method, we can use this to account for correlation.
Moran’s \(I\) is a measure of global spatial autocorrelation, and can be considered an extension of the Pearson correlation coefficient. For a set of data \(Z_1, \ldots, Z_m\) measured on regions \(B_1, \ldots B_m\), with neighbourhood matrix \(W\), we can compute Moran’s I as:
\[ I = \frac{m}{\sum_{i=1}^m \sum_{j=1}^m w_{ij}}\frac{\sum_{i=1}^m \sum_{j=1}^m w_{ij} (Z_i - \bar{Z})(Z_j - \bar{Z})}{\sum_{i=1}^m (Z_i - \bar{Z})^2} \]
This is basically a function of differences in values between neighbouring areas. By far the most common approach is to use a binary neighbourhood matrix, \(W\), denoted by
\[ w_{ij} = \begin{cases} 1 & \text{if areas } (B_i, B_j) \text{ are neighbours.}\\ 0 & \text{otherwise.} \end{cases} \]
Binary matrices are used for their simplicity. Fitting spatial models often requires us to invert \(W\), and this is less computationally intensive for sparse matrices.
Moran’s \(I\) ranges between -1 and 1, and can be interpreted in a similar way to a standard correlation coefficient.
\(I=1\) implies that we have perfect spatial correlation.
\(I=0\) implies that we have complete spatial randomness.
\(I=-1\) implies that we have perfect dispersion (negative correlation).
Our observed \(I\) is a point estimate, and we may also wish to assess whether it is significantly different from zero. We can test for a statistically significant spatial correlation using a permutation test, with hypotheses:
\[ \begin{aligned} H_0&: \text{ negative or no spatial association } (I \leq 0)\\ H_1&: \text{ positive spatial association } (I > 0) \end{aligned} \]
We can use moran.test() to test this hypothesis by setting alternative = "greater". To do so, we need to supply list containing the neighbors via the nb2listw() function from the spdep package. Lets assess now the spatial autocorrelation of the SMR in 2011:
# subset the data
resp_cases_2011 <- resp_cases %>% filter(year ==2011)
# neighbors list
nbw <- nb2listw(W.nb, style = "W")
# Global Moran's I
gmoran <- moran.test(resp_cases_2011$SMR, nbw,
alternative = "greater")
gmoran
Moran I test under randomisation
data: resp_cases_2011$SMR
weights: nbw
Moran I statistic standard deviate = 11.42, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.439899780 -0.003703704 0.001508809
A local version of Moran’s I can also be used to measure the similarity between each IZ via the localmoran function:
lmoran <- localmoran(resp_cases_2011$SMR, nbw, alternative = "two.sided")In this case we set alternative = "two.sided" to test whether there is any evidence of spatial autocorrelation in our data:
\[ \begin{aligned} H_0&: \text{ no spatial association } (I=0)\\ H_1&: \text{ some spatial association } (I \neq 0) \end{aligned} \]
We can obtain the \(Z\)-scores from the test (Z.Ii) where any values smaller than –1.96 indicate negative spatial autocorrelation, and z-score values greater than 1.96 indicate positive spatial autocorrelation:
resp_cases_2011_m <- resp_cases_2011 %>% mutate(Zscores = lmoran[,"Z.Ii"])
resp_cases_2011_m <- resp_cases_2011_m %>%
mutate (SAC = case_when(
Zscores > qnorm(0.975) ~ " M > 1" ,
Zscores < -1*qnorm(0.975) ~ " M < 1",
.default = "M = 0"),
SAC=as.factor(SAC)
)We can visualize these results using either ggplot as we did before, or via the mapview library which contains interactive features:
library(mapview)
mapview(resp_cases_2011_m,
zcol = "SAC",
layer.name = "SAC",
col.regions=c("turquoise","orange","grey40"))